Bookend: precise transcript reconstruction with end-guided assembly

We developed Bookend, a package for transcript assembly that incorporates data from different RNA-seq techniques, with a focus on identifying and utilizing RNA 5′ and 3′ ends. We demonstrate that correct identification of transcript start and end sites is essential for precise full-length transcript assembly. Utilization of end-labeled reads present in full-length single-cell RNA-seq datasets dramatically improves the precision of transcript assembly in single cells. Finally, we show that hybrid assembly across short-read, long-read, and end-capture RNA-seq datasets from Arabidopsis thaliana, as well as meta-assembly of RNA-seq from single mouse embryonic stem cells, can produce reference-quality end-to-end transcript annotations. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-022-02700-3.

label.single.fq Filepath for single-end output --min start [int] 7 Minimum allowed match to Start Tag --min end [int] 9 Minimum allowed match to End Tag --mismatch rate [0][1] 0.06 Permit up to this percent of mismatches in the tag match --minlen [int] 18 Discard trimmed reads shorter than this --minqual [float] 16 Discard any reads with a mean phred score lower than this --qualmask [float] 25 Set low-quality basecalls to N for more lenient matching --discard untrimmed False Do not keep reads without a trimmed tag --pseudomates False Pipe single reads to --out1 with an artificial mate pair in --out2 (overrides --single out) From one single-end FASTQ file or two paired-end FASTQ files, bookend label scans each read for a matching "Start Tag" and/or "End Tag", then outputs between one and three trimmed FASTQ files with the type and length of end labels appended to the read name as a " TAG=" suffix. All labeled reads are reoriented so that Mate 1 is sense to the original RNA strand. The Start Tag sequence can be provided with the -S argument, and should be the sense-stranded sequence of the oligo in cDNA upstream of 5 ends. For Smart-seq methods this is the Template Switching Oligo (TSO). The End Tag is the sequence of the 3 oligo used during reverse transcription, usually an oligo-d(T) primer. To trim tags of uncertain length, a "+" can be added after the last nucleotide, and this will be treated as an oligomer of indefinite length. For example, passing the argument -E TTTTTTTTTT+ will cause bookend label to trim and label all sequences of 10 or more T's/A's. For reads that contain a Unique Molecular Identifier (UMI) within a Start Tag or End Tag, the --umi argument specifies where to look for the UMI sequence, which will be appended as an additional " UMI=" suffix. For oligo-d(T) End Tags, it is good to provide at least part of the oligo sequence upstream of the T oligomer; this helps minimize the number of templated oligo-T/oligo-A sites that are falsely labeled as End Tags. If your files do not contain Illumina-encoded Phred quality scores, set --minqual -1 --qualmask -1. This is the main utility for generating End Labeled Read (ELR) files from aligned, name-sorted BAM/SAM files. This step can also filter out common sources of erroneous end labels, which is strongly recommended for accurate assemblies. Specifically, oligo-d(T) primers can bind to genome-templated A-rich sequences within the RNA molecule in addition to non-genome-templated poly(A) tails. Depending on the experiment and organism, this "oligo-dT mispriming" can be extremely common, resulting in many false positive End Tags. Various 5 -end sequencing techniques also suffer from known artifacts. Libraries that make use of Template Switching Oligos (TSO), including the Smart-seq and IsoSeq methods and some 5 -end sequencing protocols, are susceptible to template switching artifacts at regions that share sequence similarity to the TSO's 3 end. If a reference genome is provided to bookend elr, then it can perform artifact masking for both 5 and 3 End Tags.
Artifact masking depends on the sequences provided by --start seq and --end seq. These two sequences should be the expected untemplated sequence on the sense cDNA strand immediately upstream and downstream of the RNA sequence, respectively. Default parameters are optimized based on Smart-seq2 libraries in mouse and Arabidopsis. These include a very permissive mismatch rate (20%), and an oligo-dT mispriming site of up to 30 purines (IUPAC code R). If a long artifact masking sequence is provided, matching is performed up to the trimmed tag length present in the read's " TAG=" suffix.
When provided a reference genome, bookend label also performs Cap Inference on Start Tags. In 5 -end labeled sequencing libraries, the presence of a non-genome-templated 5 -terminal G between the trimmed oligo and the aligned sequence is evidence that this read possessed a 7 m G cap structure. These reads can be extremely valuable for distinguising genuine transcription start sites from the products of RNA degradation that occur both in vivo and in vitro.

The End Labeled Read file format
End Labeled Read (ELR) files are defined in two parts: a header that builds an index of reference chromosomes (#C) and read sources (#S), and a 7-column body with the following contents: ELCIGAR strings describe the position and labels of all aligned segments of a read, patterned off of the BAM/SAM format CIGAR strings but with additional End Label information. They are strings of Character/Number pairs with one trailing character ([CN ] x C), where C is a label and N is a numeric distance on the genome. Each label annotates the end of the associated span as one of the following: Splice junction donor A Splice junction acceptor .
Unspecified gap/end For example, the ELCIGAR for a 50bp paired-end read of a 185nt cDNA fragment with no introns or end labels would be ".50.85.50.". A full-length transcript with the ELCIGAR "S256D800A128D800A512E" has 3 exons of 256, 128, and 512nt, respectively, and 2 introns that are both 800nt.
Short indels and mismatches are not recorded in ELCIGAR strings, but the number of alignment errors within each exonic region (between adjacent SD, CD, AD, and AE pairs) is tallied. If this tally exceeds --error rate as a proportion of the number of aligned bases in that exon, then it is removed from the ELCIGAR and the surrounding exons are bridged with an unspecified gap (.. pair). This setting prevents the use of splice junctions from especially error-prone alignments, which can be common in some long-read sequencing protocols.
During assembly, the "weight" column will be used to determine read coverage depth for exonic frags and for end positions. It can either be a single float value (coverage), or a set of 3 floats separated by pipe symbols, indicating start tags-coverage-end tags. These values will only differ if the ELR feature is the product of assembly. If a Start Tag, Cap Tag, or End Tag is lowercase (s, c, e), bookend assemble will treat these ends as having a weight of 1 instead of the read weight. This can be beneficial for partial assembly (discussed in the next section) of sparsely-labeled samples. This function is a special case of end-guided assembly, in which assemblies are not penalized or filtered out if they are not end-to-end complete. This "partial assembly" is specifically designed for single-cell RNA-seq datasets, so that the reads from each cell can be condensed for efficient meta-assembly.
Partial assembly takes a single ELR file input and writes a single ELR file. By default all reads are reported with minimal filtering. All loci with multiple overlapping reads are assembled using the standard end-guided assembly algorithms, but no penalty is applied for incomplete starts/ends. These output files can then be used as input to bookend assemble. Assembling first on the single-cell level guarantees that exon chains from each cell remain coherent during the full-scale assembly. End-capture libraries can be condensed in a way that exclusively retains end clusters and their abundance using the argument --start or --end. For sparsely labeled samples, like short reads from full-length sequencing protocols, coverage depth of the condensed read will not be representative of of the Start Tag and End Tag abundance, so the --sparse argument writes end labels as lowercase to drop their priority during assembly. This is the main Bookend subcommand. From one or more sorted ELR files, bookend assemble packages reads into loci, then resolves each locus via end-guided assembly. The algorithms are more fully described as pseudocode in the manuscript appendix "Assembly Algorithms".
Assembly begins by reading through the input file(s) and packaging reads into discrete chunks. If a gap between adjacent aligned reads larger than --max gap exists, the current chunk of reads is processed and a new one is started. The argument --min proportion is used as a global "noise filter" and sets the threshold below which rare features should be filtered out. After filtering low-abundance splice junctions within the chunk, it can be further split into subchunks if gaps appear that were previously passed over by a filtered splice junction.
In addition to the global noise filter, the user has control over a number of other filters that can be modified to either retain or remove lower-confidence assemblies. --min cov, --min start, and --min end set the threshold of evidence required for a transcript's estimated abundance, Start Tags, and End Tags, respectively. Limits can also be imposed on the structure of the transcript with --min len to exclude very short transcripts and The user can also determine the characteristics of Start Tag and End Tag clustering. Both Transcription Start Sites (TSS) and Polyadenylation Sites (PAS) can occur in broad clusters, sometimes spanning hundreds of bases on the genome. The End Clustering algorithm treats Start Tags on the same strand within a distance --end cluster to belong to the same Start Cluster, and the same is done for End Tags. Additionally, the weight of Cap Tags (Start Tags with an upstream untemplated G) is multiplied by --cap bonus. To further minimize the chance of false positive TSS from RNA cleavage or degradation, a --cap filter is enforced for Start Clusters not at the extreme ends of a locus. Finally, passing the argument --require cap excludes even terminal Start Clusters if they do not pass the --cap filter.
Finally, Bookend can source information in the edge weights of its Overlap Graph, which reduces the chances of producing chimeric transcripts with making an assembly from multiple samples. The estimate coverage of each source in each sample can be written to a TSV file if the filepath is provided to --cov out. In some cases it may be beneficial to ignore source, like in a hybrid assembly that uses reads from multiple library types (in the manuscript this is demonstrated with short-read + long-read + 5 -end libraries). By default assembly is "source-naive"; --use source may improve assembly, but it can be very resource intensive if there are many input samples. If more than 50 samples are being combined, use bookend condense on each sample beforehand, then assemble the set of condensed assemblies. classify.tsv Path to write output TSV file --end buffer [int] 100 Distance between ends to consider a match --ref parent [list] mRNA transcript Line type(s) signifying a parent object --ref child [list] exon Line type(s) signifying a child object --parent attr gene [str] gene id (GTF/GFF) Attribute that stores gene id in parent objects --child attr gene [str] gene id (GTF/GFF) Attribute that stores gene id in child objects (exons) --parent attr transcript [list] transcript id (GTF/GFF) Attribute(s) storing transcript id in parent objects --child attr transcript [list] transcript id (GTF/GFF) Attribute(s) storing transcript id in child objects (exons) --bed gene delim [str] .

String that splits gene name from isoform number
This utility compares two annotation files; one acts as a "reference", and the other acts as "input". For each transcript model in the input annotation, the reference annotation is queried for the closest structural match. A hierarchy of classifications is used to define each input transcript: full match Exact exon chain, starts and ends closer than --end buffer exon match Exact exon chain, but ends do not match fusion Shares exons with 2 or more reference genes fragment Contained in a reference transcript, missing exon(s) isoform Closest match has incompatible exon chain intronic Fully contained in a reference intron (sense) antisense Only overlaps a reference transcript on antisense strand ambiguous Nonstranded transcript overlaps a reference transcript intergenic No reference overlap The output file is a 13-column TSV file with the following information: This utility compares takes one or more ELR files as input and generates a bedgraph file that summarizes read abundance by genomic position. By default, the entire aligned length of all reads will be "piled up". It can be more informative to examine only the 5 end of start-labeled reads or the 3 end of end-labeled reads. with the --type argument, you can choose to quantify only the 5 -or 3 -terminal positions of starts and ends, respectively.
-t E and -t 3P are identical; however, -t 5P counts all start-labeled reads regardless of the presence/absence of a cap, whereas -t C only counts caps and -t S only counts start tags without caps.
When counting ends, the behavior of --strand . changes. If no strand is specified for -t COV, it is assumed that read coverage is not strand-specific, and all reads will be piled up as positive values. However, if counting start tags or end tags, --strand . will add forward-stranded tags as positive values and reverse-stranded tags as negative values. This is a more compact representation than a separate bedgraph file for each strand, but caution! In the even of tags occupying the same genomic position on opposite strands, they will cancel each other out. bookend fasta converts an annotation from a set of genomic coordinates (stored as a GTF, GFF3, BED12, or ELR file) to a set of transcript sequences in FASTA format. Introns are removed and the reverse complement is written for reverse-stranded features. Combine more than one ELR file into a single position-sorted file with a uniform header. If the number of input files exceeds your open file resource limits, the files will be sent to a set of temp files with the --temp prefix, then these temp files will be combined. None GTF attribute to pass to the BED score column --gtf parent [list] transcript Line type(s) in GTF file that define a Parent object --gtf child [list] exon Line type(s) in GTF file that define a Child object --gff parent [list] mRNA transcript Line type(s) in GFF3 file that define a Parent object --gff child [list] exon Line type(s) in GFF3 file that define a Child object --child attr gene [str] gene id (GTF/GFF) Attribute that stores gene id in child objects (exons) --parent attr transcript [list] transcript id (GTF/GFF) Attribute(s) storing transcript id in parent objects --child attr transcript [list] transcript id (GTF/GFF) Attribute(s) storing transcript id in child objects (exons) --color code [file] None Tab-separated file of transcript types -> R,G,B colors --color key [str] None GTF/GFF3 attribute name to lookup transcript type

elr-sort
Converts a GTF/GFF3 file to BED12. This can be used, for example, to convert an annotation file into a format that can be used in assembly. It is also recommended to convert GTF or GFF3 annotations for bookend classify, since these file formats can have inconsistent formatting that results in improper parsing of the transcript models. For STAR SJ.out.tab files. Combine multiple SJ files into one, and apply optional filters. This utility can be used to create a high-confidence SJDB file from first-pass-aligned reads to improve sensitivity of novel splice junctions during second-pass alignment. SJ.bed Path to write output BED file Some aligners may ask for an SJDB file in BED format, like minimap2. This utility converts the STAR SJ.out.tab format to a standard 5-column BED file.